Fast Approximate Search in Large Dictionaries

نویسندگان

Stoyan Mihov

Klaus U. Schulz

چکیده

The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a “universal Levenshtein automaton,” we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generat...

متن کامل

Fast Large-Scale Approximate Graph Construction for NLP

Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorith...

متن کامل

Advances in multirate filter bank structures and multiscale representations

We propose a new framework to extract the activity-related component in the BOLD functional Magnetic Resonance Imaging (fMRI) signal. As opposed to traditional fMRI signal analysis techniques, we do not impose any prior knowledge of the event timing. Instead, our basic assumption is that the activation pattern is a sequence of short and sparsely-distributed stimuli, as is the case in slow event...

متن کامل

A Breadth-First Representation for Tree Matching in Large Scale Forest-Based Translation

Efficient data structures are necessary for searching large translation rule dictionaries in forest-based machine translation. We propose a breadth-first representation of tree structures that allows trees to be stored and accessed efficiently. We describe an algorithm that allows incremental search for trees in a forest and show that its performance is orders of magnitude faster than iterative...

متن کامل

Efficient Similarity Search via Sparse Coding

This work presents a new indexing method using sparse coding for fast approximate Nearest Neighbors (NN) on high dimensional image data. To begin with we sparse code the data using a learned basis dictionary and an index of the dictionary’s support set is next used to generate one compact identifier for each data point. As basis combinations increase exponentially with an increasing support set...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computational Linguistics

دوره 30 شماره

صفحات -

تاریخ انتشار 2004

Fast Approximate Search in Large Dictionaries

نویسندگان

چکیده

منابع مشابه

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

Fast Large-Scale Approximate Graph Construction for NLP

Advances in multirate filter bank structures and multiscale representations

A Breadth-First Representation for Tree Matching in Large Scale Forest-Based Translation

Efficient Similarity Search via Sparse Coding

عنوان ژورنال:

اشتراک گذاری